Skip to content

fix(providers/common-ai): LlamaIndexEmbeddingOperator always returns …#68434

Closed
bramhanandlingala wants to merge 1 commit into
apache:mainfrom
bramhanandlingala:fix/#68416
Closed

fix(providers/common-ai): LlamaIndexEmbeddingOperator always returns …#68434
bramhanandlingala wants to merge 1 commit into
apache:mainfrom
bramhanandlingala:fix/#68416

Conversation

@bramhanandlingala

Copy link
Copy Markdown
Contributor

Summary:
LlamaIndexEmbeddingOperator.execute() always returned "vector": None for every chunk in the output. The root cause is that VectorStoreIndex._get_node_with_embedding() calls node.model_copy() and assigns the embedding on the copy — the original node objects are never mutated. This has been the behavior across all tested llama-index-core versions (v0.10–v0.14).

Fix:
Pre-embed all nodes by calling embed_model.get_text_embedding_batch() and assigning the resulting vectors onto the original node objects before passing them to VectorStoreIndex. Since embed_nodes() inside VectorStoreIndex skips nodes whose .embedding is already populated, this produces no duplicate embedding API calls.

Testing:
Manually reproduced the bug with a mock stub mirroring llama-index's model_copy() behavior
Verified the fix produces correct non-None vectors for all chunks
No changes to public API or operator interface

Related: Fixes #68416

@ashb ashb left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs unit tests please

# ``embed_nodes()`` inside VectorStoreIndex skips nodes whose
# ``.embedding`` is already set, so pre-embedding causes no duplicate
# API calls.
texts = [node.get_content() for node in nodes]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_content() with no argument is MetadataMode.NONE, but llama-index's own embed_nodes() embeds node.get_content(metadata_mode=MetadataMode.EMBED), which includes node metadata and respects excluded_embed_metadata_keys. Checked on llama-index-core 0.14.22: for a node with metadata={"src": "a"} the library embeds 'src: a\n\nhello world' while this embeds just 'hello world'. Since the operator attaches user metadata to every Document and the pre-set embeddings are reused for the persisted index, metadata stops contributing to the vectors entirely. Suggest metadata_mode=MetadataMode.EMBED here so the pre-embed step keeps the same embedding semantics llama-index applies itself.

# ``.embedding`` is already set, so pre-embedding causes no duplicate
# API calls.
texts = [node.get_content() for node in nodes]
vectors = embed_model.get_text_embedding_batch(texts, show_progress=False)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks test_byo_embed_model_bypasses_hook: _byo_embedding() builds its mock with spec=["get_text_embedding", "_get_query_embedding"], so calling .get_text_embedding_batch on it raises AttributeError. The duck-type check in _resolve_embed_model has the same gap, it accepts anything with those two methods but execute now needs a third. Adding get_text_embedding_batch to both the spec and the check would cover it, along with the tests ashb asked for on the new pre-embed path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LlamaIndexEmbeddingOperator returns vector=None for every chunk

3 participants